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1 .  Introduction 


In  regression  analysis,  the  response  y  is  often  transformed  for  two 
distinct  purposes,  to  induce  normally  distributed,  homoscedast ic  errors 
and  to  improve  the  fit  to  some  simple  model  involving  an  explanatory 
variable  x.  In  many  situations,  however,  y  is  already  believed  to  fit  a 
known  model  f(x,/J),  p  being  a  p-dimensional  parameter.  If  a 
transformation  of  y  is  still  needed  to  remove  skewness  and/or 
heteroscedasticity ,  then  this  model  can  be  preserved  by  transforming  y 
and  f  (x,/J)  in  the  same  manner.  Specifically,  let  y^A*  be  a 
transformation  indexed  by  the  parameter  A  and  assume  that  for  some  value 


(A)  ( A )  .. 

y  j  =  f  ( x  ,/3 )  +  oe.  . 


where  t . t  are  independent  and  at  least  approximately  normally 

1  N 

distributed  with  variance  1.  Notice  the  difference  between  (1)  and  the 
usual  approach  of  transforming  only  the  response,  not  f(x./5),  i.  e., 


y(A  *  =  f  (x,/J)  *  at  .  . 


It  should  be  emphasized  that  model  (1)  is  not  intended  as  a  substitute 
for  (2).  Both  models  are  appropriate,  but  under  quite  different 
circumstances.  Model  (2)  has  been  amply  discussed  by  Box  and  Cox  (1964) 
and  others,  e.  g..  Draper  and  Smith  (1980).  Cook  and  Weisberg  (1982)  and 
Carroll  and  Ruppert  (1981). 


v  '  aJv- 


\  in (,Tf  9 y 


.i  J  or 


>Vl,VA 
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Typically.  in  Model  (2),  f(x,£)  is  linear  but  in  principle 
nonlinear  models  can  be  used.  Model  (1).  which  we  call  "transform  both 
sides ",  has  been  investigated  by  Carroll  and  Ruppert  (1984),  Snee 
(1986),  and  Ruppert  and  Carroll  (1986)  and  we  will  only  summarize  those 
discussions.  According  to  (1),  f(x,/J)  has  two  closely  related 
interpretations;  f(x,/J)  is  the  value  of  y  when  the  error  is  zero  and  it 
is  the  median  of  the  conditional  distribution  of  y  given  x.  In  Carroll 
and  Ruppert  (1984),  we  were  concerned  with  situations  where  a  physical 
or  biological  model  provides  f(x,/5),  but  where  the  error  structure  is  a 
priori  unknown.  Examples  by  Snee  (1986),  Carroll  and  Ruppert  (1984), 
Ruppert  and  Carroll  (1985),  and  Bates,  Wolf,  and  Watts  (1985)  show  that 
transforming  both  sides  can  be  highly  effective  with  real  data,  both 
when  a  theoretical  model  is  available  and,  as  Snee  shows,  when  f(x,/J)  is 
obtained  empirically. 

By  estimating  A,  o.  and  p  simultaneously,  rather  than  simply 
fitting  the  original  response  y  to  f(x,/J),  we  achieve  two  purposes. 
First,  p  is  estimated  efficiently  and  therefore  we  obtain  an  efficient 
estimate  of  the  conditional  median  of  y.  Second,  we  model  the  entire 
condtional  distribution  of  y  given  x,  and,  in  particular,  we  have  a 
model  which  can  account  for  the  skewness  and  heteroscedasticity  in  the 
data.  Carroll  and  Ruppert  (1984)  discuss  the  importance  of  modeling  the 
conditional  distribution  of  y  in  a  special  case,  a  spawner-recruit 
analysis  of  the  Atlantic  menhaden  population.  In  section  6  we  discuss 


the  estimation  of  the  conditional  mean  and  conditional  quantiles  of  y 
given  x. 
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Many  data  sets  we  have  examined  have  had  severe  outliers  in  the 

untransformed  response  y,  but  not  in  the  residuals  e  (/).A)  =  [y*A* 

f^A'(x,/5)];  the  transformation  has  accommodated,  or  explained,  the 
outlying  y's.  There  is  still  the  danger,  however,  that  a  few  outliers 
in  y  can  greatly  affect  /3  and  A.  Outliers  should  not  be  automatical  ly 
deleted  or  downweighted,  especially  when  they  appear  to  be  part  of  the 
normal  variation  in  the  response,  but  it  should  be  standard  practice  to 
detect  and  scrutinize  influential  cases. 

When  influential  cases  are  present  and  they  have  an  unacceptable 
effect  on  the  MLE ,  then  the  best  remedy  is  not  simply  to  delete  these 
cases  but  rather  to  apply  a  robust  estimator.  There  are  several  reasons 
why  this  is  so.  First,  as  Hampel  (1985)  illustrates,  robust  estimates 
are  generally  somewhat  more  efficient  than  outlier  rejection  along  with 
a  classical  estimator  such  as  the  MLE.  Moreover,  outlier  rejection 
affects  the  sampling  distribution  of  classical  methods  in  ways  that  have 
not  been  fully  studied.  In  contrast,  the  large-sample  distribution  of 
most  robust  estimators,  in  particular  M-estimators  which  will  be  used 
here,  can  be  easily  calculated. 

In  this  paper  we  propose  a  case-deletion  diagnostic  and  a 
"bounded-influence"  estimator. 

Case  deletion  diagnostics  for  linear  regression  are  discussed  in 
Belsley.  Kuh,  and  Welsch  (1980)  and  Cook  and  Weisberg  (1982),  and  have 
been  extended  to  the  response  transformation  model  (2)  by  Cook  and  Wang 
(1983)  and  Atkinson  (1986).  The  last  two  papers  approximate  the  change 
in  A  as  single  cases  or  subsets  of  cases  are  deleted. 


Another  approach  to  influence  diagnostics  is  measuring  changes  in 
the  statistical  analysis  under  infinitesimal  perturbations  in  the  model. 
Cook  (1986)  gives  an  introduction  to  this  theory,  which  he  calls  "local 
influence".  We  will  not  consider  local  influence  for  transformation 
models,  but  this  seems  a  promising  area  for  research. 

When  the  response  transformation  model  is  used,  p  depends  heavily 
on  A,  and  it  seems  better  to  examine  influence  for  p  only  after  A  has 
been  determined.  In  constrast,  under  the  "transform  both  sides"  model  p 
and  A  are  only  weakly  related,  with  p  determined  by  the  median  of  the 
untransformed  response,  and  A  determined  by  the  skewness  and 
heteroscedasticity .  Therefore,  influence  for  p  and  A  can  be  treating 
simultaneously . 

For  the  "transform  both  sides"  model  we  propose  two  approximations 

T  T 

to  the  changes  in  9  =  (p  , A)  as  cases  are  deleted.  As  shown  in  the 
next  section,  the  MLE  can  be  found  as  the  least-squares  estimate  of  a 
certain  "pseudo-model".  Both  approximations  start  by  linearizing  this 
pseudo-model  around  the  full-data  estimate,  and  for  each  case  take  one 
step  of  an  iterative  procedure  for  finding  the  estimate  without  that 
case.  The  first  approximation  takes  one  step  of  the  Newton-Raphson 
procedure,  in  effect  using  an  accurate  approximation  to  the  Hessian 
matrix  of  the  sum  of  squares  for  the  pseudo-model.  The  second 
approximation  is  based  on  Atkinson’s  "quick  estimate",  and  is  equivalent 
to  using  one  step  of  the  Gauss-Newton  rather  than  the  Newton-Raphson 
algorithm.  It  is  considerably  less  accurate  than  the  first,  but  is  more 
easily  implemented  on  standard  software  and  is  useful  for  diagnostic 


purposes . 


Subset  deletion  is  simple  in  theory  but  can  be  unwieldly  in 
practice  because  of  the  large  number  of  possible  subsets.  If 
influential  subsets  are  to  be  detected,  some  strategy  is  needed  to 
search  for  them.  An  alternative  to  subset  deletion  and  a  good 
supplement  to  single  case  deletion  diagnostics  is  to  compare  the  fit  of 
a  highly  robust  estimator  with  that  of  the  MLE .  In  the  example  of 
section  6,  a  robust  estimator  reveals  an  observation  whose  influence  was 
masked  by  another  observation  and  could  not  be  detected  by  single 
case-deletion  diagnostics. 

Bounded-influence  estimators  place  a  bound  on  the  influence 
function  of  each  observation.  Bounded-influence  regression  estimators 
have  been  proposed  by  Krasker  (1980),  Hampel  (1978),  and  Krasker  and 
Welsch  (1982).  The  last  paper  and  Hampel  et  al.  (1986)  provide  a  good 
overview. 

Huber  (1983)  has  questioned  the  need  for  bounded-influence 
estimators.  They  appear  to  be  based  on  the  pessimistic  philosophy  that 
nature  will  place  response  outliers  precisely  where  they  can  do  the  most 
harm,  on  the  high  leverage  points. 

We  disagree  with  Huber  and  feel  that  such  pessimism  is  justified. 
Apparent  response  outliers  are  often  due  to  gross  errors  in  the 
measurement  or  recording  of  the  explanatory  variables  x.  In  addition, 
response  outliers  can  be  caused  by  model  breakdown.  Both  an  incorrect 
model  for  the  median  response  or  an  incorrect  specification  of  the 
variance  function,  say  a  constant  variance  model  where  the  variance 
actually  depends  on  the  median  or  mean,  will  lead  to  response  outliers. 
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In  the  latter  case,  the  y  aay  be  outlying  only  relative  to  the  assumed 
variance,  not  the  true  variance,  but  this  is  enough  to  cause  problems. 
For  all  these  reasons,  outlying  y  values  are  more  likely  at  unusual  x 
values.  Huber's  objection  to  bounded-influence  estimation  follows 
logically  from  his  model  that  the  errors  are  identically  distributed 
with  a  heavy-tailed  density,  but  this  model  is  unrealistic  in  many 
modeling  situations. 

Moreover,  even  if  one  accepts  Huber's  conclusions  about  regression 
modeling,  they  apply  only  when  there  is  no  need  to  estimate  a 
transformation  parameter.  The  analog  of  leverage  for  A  is  the 
derivative  of  the  residual  e.(fi)  with  respect  to  A .  A  response  outlier 
will  usually  make  this  derivative  large.  Therefore,  response  outliers 
can  induce  high  leverage  points  in  transformation  models  even  at  x’s 
that  are  in  no  way  unusual. 

Carroll  and  Ruppert  (1985)  proposed  a  bounded-influence 
transformation  (BIT)  estimator  extending  the  Krasker-Welsch  estimator  to 
the  response  transformation  model  (2).  In  this  paper  we  adapt  this 
estimator  to  the  "transform  both  sides"  model.  We  also  discuss 
computational  aspects  and  propose  a  simple  nne-step  estimator  that  can 
be  implemented  on  standard  software  packages  such  as  SAS. 


2.  Weighted  Maximum  Likelihood  Es tlmatlon 


All  estimators  used  in  this  paper  are  found  by  maximizing  a 
weighted  log- 1 ikel ihood .  When  the  weights  are  identically  1,  then  the 


'  A-*  ■  V-  V  r  > 


estimator  is  the  MLE.  The  robust  estimators  introduced  in  section  4  are 


weighted  MLE ' s  with  the  weights  less  than  1  for  influential  cases  and 

equal  to  1  otherwise.  Let  w . w.,  be  fixed  weights.  For  now,  they 

1  N 

T  T 

will  depend  on  the  y's  but  not  the  parameter  9  =  (/}  ,  A }  In  section  4 
the  weights  will  depend  on  9,  but  9  will  be  fixed  at  a  preliminary 
estimate  9 

P 

Throughout  this  paper  y^ ^  is  the  modified  power  transformation 
family  used  by  Box  and  Cox  (1964); 

=  (yA  -  1  )  /A  if  A  *  0, 

=  log(y)  if  A  =  0. 

Our  analysis  will  be  conditional  on  the  observed  x's.  This  is 

T  T 

appropriate  both  for  fixed  and  random  x's.  Let  9  =  {p  ,A )  be  the 
vector  of  the  transformation  and  regression  parameters.  and  let 
g.(y.,fl.o)  be  the  conditional  density  of  y.  given  x..  The 

l  i  l  i 

log- 1  ike  1 ihood  for  y.  is 

«  .  ( y  A «  ,o  )  =  log  g  A  (  y  i .  0  .a)  = 

-  (1/2)  log(  2ko  2  )  *  (A  -  1 )  1  og  ( y  j )  *  (2a2)  1  [e.(0)]2. 


where 


e  .  (9  ) 

l 


U  ) 


f  (M(x../3) 


'  * 


The  weighted  log-likelihood  for  y . yM  is 


L(8  ,a )  =  r  1  =  1  w i  <  .(yre  .0} 


For  fixed  8 


o2(9)  =  (Z1;1  w.  (e.(8))2}  /  {z*ml  w.} 


maximizes  L(8,o)  over  8.  Let  y  be  the  weighted  geometric  mean  of 

y . .  y„  def  ined  by 

1  N 


log(y)  =  {I1^1  *ii  log(y.)}  /  <*£  w.  } . 


The  weighted  MLE  of  8  maximizes 


L  (8)  •  L(8  ,a (8  ) )  = 
max 


(4)  ZS  w.  {-  (l/2)log(2 no2)  +  (A  -  l)log(y.)  -  1/2} 


(1/2)  X  *  w  1  {  log[Z i^1  w.  ( e .  (8 )  /  yA  1)2]  +  1} 


Since  the  weights  w.  are  fixed,  8  minimizes 


N  *  A  2 

SS (8 )  =  Z  .  .  w.  (e  .  (8 )/y  ] 


f'age 
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Following  Box  and  Cox  (1964).  8  can  be  computed  as  follows.  For  fixed 
A,  minimize  SS(8)  in  /J  by  ordinary  (typically  nonlinear)  least-squares 
and  call  the  miniaizer  /5(A),  Plot  L  (/5(A), A)  on  a  grid  and  maximize 

M3  X 

graphically  or  numerically.  This  technique  is  particularly  attractive 

T 

when  f  is  not  transformed  and  f(x,/5)  =  x  /5  for  then  (5)  can  be  minimized 
in  /5  by  linear  least-squares.  When  transforming  both  sides,  this 
technique  is  less  attractive  computationally  but  for  the  unweighted  MLE 
it  does  give  the  confidence  interval 

(A:  L  (/5(A), A)  >  L  (/5(A). A)  (12)  K  f  (  1  a)}. 

H3 x  max  l 

2 

where  Xj(l  a)  Is  the  (1  a)  quantile  of  the  chi  square  distribution 
with  one  degree  of  freedom  Minimizing  (5)  simultaneously  in  a  and  8  is 
straightforward  with  standard  nonlinear  regression  software  One  first 
creates  a  dummy  variable  which  is  0  for  all  cases.  Define 

•A  (  a  )  (A)  ,  -A 

z.(8)  e  .  ( 8  )  y  -  [y  f  lx.,/5)]  y 

One  fits  the  pseudo-model " 

(6)  D.  =  Z.  (8 ) , 

l  l 

iwith  "response'’  D.,  "regression  function"  z.(8).  "regression  parameter 


8,  and  "independent  variable"  (y.,x  ).  The  dummy  variable  D  is  created 


age 


because  nonlinear  least  squares  software  typically  does  not  allow  the 
response  to  depend  on  the  parameters.  In  (6)  the  real  response  is 
incorporated  into  the  model  so  that  is  can  be  transfor»ed  by  A. 

The  standard  error  of  A  that  is  output  when  fitting  (6)  with  a 
least -squares  package  should  not  be  used.  It  is  not  that  sane  as  the 
standard  error  fro*  the  inverse  Fisher  information  and  it  does  not 
consistently  estimate  the  large-sample  standard  deviation  of  A;  see 
section  5. 


3.  Diagnostics 


A  simple  and  easily  interpreted  way  of  measuring  the  influence  of 
the  ith  case  is  to  recompute  9  with  this  case  deleted.  Since  nonlinear 
estimation  can  be  computer  intensive,  we  will  describe  two  simple 
approximations  to  these  case-deletion  diagnostics. 


Let  9  and  9^  be  the  MLE  of  9  with  and  without  the  ith  case,  and 
E  g  E 

let  d.  -  (d  /J.  ,  d  A.)  =9  -  he  the  exact  change  in  9  upon  deletion 

2 

of  this  case.  Let  vz^(9)  and  v  z.(9)  be  the  gradient  and  Hessian  of 

E 

z.(9)  The  first  step  in  the  approximation  to  d.  will  be  to  ignore  the 
change  in  y  when  yj  is  deleted.  From  the  previous  section.  9  is  the 
solution  to 


Z  ^  ,  Z  .  (9  )  vz  .  (9  ) 


An  approximate  solution  to  (7)  is  obtained  by  taking  one  step  of  the 
Newton-Raphson  algorithm  beginning  at  0.  This  requires  the  differential 
of  the  left-hand  side  of  (7)  which  is 


(8) 


+  Zj(»)  V  2Zj  (0  )  } 


Then  the  one-step  approximation  to  0  is 


(i) 


,-l 


0(1,  =0  f-H(1)(«n  H(i)(0) 


0  *  [*H(I)(0)1  (z^e)  V*  (•  ) )  .  since 


Zj  =  i  2j(0)  7Z,|8)  =  0. 


To  simplify  vH^(0)  we  first  note  that  by  the  law  of  large  numbers 


ya  exp{E[Zi  =  1log(yi )/N  |  . xNJ  >  =  v .  say. 


If  y  is  replaced  by  p  in  z^(0).  then  0/  dp  z.(0)  does  not  depend  on 


y . y...  Thus,  only  the  lower  right-hand  element  of  v  z.(0)  depends 

IN  j 


on  y  ^  ,  so  that  2^(0)  is  independent  of  all  other  entries  of  v  z^(0). 


Therefore,  letting  0Q  be  the  true  parameter  we  have 


E(z.(0Q)  v2z  j  (©q)  ]  a  diag{0 . 0,  E[z.(0Q)  (d2/  <5A2  z  ^  (e  Q )  )  ]  > 


s i nee  Ez  ( 6^ )  -  0 . 


Therefore,  letting 


(9)  B j  =  £j#1  {  ?Zj(8)  vTzj(fi)  +  where 

D  =  D  (8)  =  diag{0 . 0.  Zj(8)  (02/0  A2  z^e))} 

we  have  vH  ^  ^ (8 )  a  Bj . 

The  approximate  influence  diagnostic  for  the  ith  case  is 

^\)  =  -b'1  vZj(8)  Zj(»)  a  •  "  «{1)- 

The  inverse  of  B  can  be  easily  computed  using  the  well-known 
J 

identity  [Rao  (1973.  page  33)] 

T  -1  -1  -1  T  -1  T  -1 

(10)  (A  -  uv  )  =  A  +  A  uv  A  /(I  -  v  A  u), 

which  holds  for  nonsingular  p  x  p  matrices  A  and  p-dimensional  vectors  u 
T  - 1  T 

and  v  such  that  v  A  u  *  1  (so  that  A  -  uv  is  nonsingular). 

Let 

C  =  X  ^  (vZj(8)  vTZj(0)  +  z  j  (8  )  v2Zj(0)}. 

Cj  =  C  -  vz  j(8  )  vV(8  )  . 

2  *  1/2 

u.  =  (0 . 0.  |  z  (8)  v  z  (8)|  ).  and 


Using  (10)  twice  we  have 


T  T 

Then  D  .  =  u  .v  .  so  that  B.  =  C.  -  u  .v  . . 

J  J  J  J  j  J  J 

(11)  Cj"1  =  c"1  +  c_1vZj((9  )vTZj(e)c_1/(i  -  vTZj(e)c_1vZj(«)) 
and  then 

-1  -1  -i  T  -1  T  -1 

(12)  B.  =  C.  +  C.  u.v.C.  /(l-ujc,  v,). 

J  J  jjjj  jjj 

Using  (11)  and  (12)  allows  us  to  compute  B„  1 . B*  using  only  the 

single  matrix  inversion  needed  to  calculate  C  1.  If  we  ignore  the  Dj(e) 
in  (9).  then  we  obtain  an  even  simpler,  but  less  accurate,  approximation 

4?  =  (^Q/J.,  Aj)  =  -  [Z  ,vZj(9)vTZj(0)]_1Zj(9)vZj(e). 

is  analogous  to  the  "quick-estimate"  diagnostic  used  by  Atkinson 
(1986)  for  the  response  transformation  model.  The  advantage  of  is 
ithat  it  can  be  computed  using  standard  software  packages  that  calculate 
linear  regression  diagnostics.  To  do  this,  one  creates  a  lineariZed 


model  with 


as  the  vector  of  dependent  variables  and 

(TZjfS  ) . VZN(fl  >  )T 

as  the  design  matrix.  Then  A®,  i  =  1 . N,  are  the  diagnostics  DFBETA 

of  Belsley,  Kuh,  and  Welsch  (1980,  page  13).  Some  software  packages 
compute  a  scaled  version  DFBETAS,  which  is  equally  useful  for 

diagnostics.  Cook's  D  or  DFFITS  from  the  linearized  model  can  be  used 
as  measures  of  total  influence  for  p  and  A. 

If  we  used  one  step  of  the  Gauss-Newton,  rather  than  the 

Newton-Raphson .  algorithm  when  solving  (7)  then  we  would  obtain^,  not 

A 

A The  difference  between  the  Gauss-Newton  and  Newton-Raphson 

algorithms  is  that  the  former  uses  an  approximate  Hessian  which  in  the 

N  2 

present  notation  consists  of  ignoring  the  termXj=1  Zj(«)v  zj(®)  in  the 
Hessian  of  the  sum  of  squares.  For  regression  models  without  a 
transformation  parameter,  the  residuals  are  uncorrelated  with  their 
Hessians  and  the  Gauss-Newton  approximation  is  acceptable 

In  the  example  of  section  6  and  in  all  other  examples  that  we  have 

0  E  A 

examined,  |A^|  substantially  overestimates  large  values  of  |a.|  while  A. 

g 

is  a  good  approximation  to  A  .  The  overestimation  results  from  positive 

2  2 

correlation  between  z  ^(0)  and  O  /<3A  z  ^(9  )  causing 

Xj.N1  Z  (8  )  [<?2/<5A2  Zj(8  )  ] 

N  ~  2 

to  be  positive  and  of  the  same  magnitude  as  [O 'OK  z^#)] 
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0  E 

If  we  use  A  j  only  as  a  diagnostic,  then  the  overestimation  of  |^±A  ^  | 

is  not  a  serious  proble*  and  it  does  not  prevent  us  fro*  detecting 

influential  cases. 

E 

The  diagnostic  or  its  approximations  are  vectors  showing 

influence  separately  for  each  parameter.  An  overall  measure  of 
influence  of  the  ith  case  is  the  exact  likelihood  distance  defined  by 
Cook  and  Weisberg  (1982,  section  5.2)  as 


(.31  id' -  a(uMX(.‘(  -  lmi(.‘(1))]  .  Wi>T<’\ax<*»4 


The  approximation  in  (13)  Is  also  found  in  Cook  and  Weisberg  (1982)  and 

follows  from  a  Taylor  expansion  using  vL  (0 )  =  0.  We  can  define  an 

I3X 

A  E  A 

accurate  approximation,  LD.  by  replacing  4.  in  (13)  withd^.  As  a  quick 


approximation  one  can  use 


0  T  N  T  0 

LD,  =  (4^1  [I  „vz,(0)v  z.(0)))al7. 
i  l  1=1  x  l  1 


which  is  a  constant  multiple  of  Cook's  1)  from  the  linearized  model. 

From  the  above  discussion  we  can  expect  that  LD^  will  not  be  an  accurate 

E 

approximation  to  LD. 


4.  Robust  Estimation 


Once  influential  cases  have  been  identified,  we  must  decide  how 
they  should  he  treated  In  some  situations,  the  statistician  will  feel 


that  they  are  valid  data,  that  they  do  not  indicate  model  deficiencies, 
and  that  they  should  be  allowed  full  influence  on  the  analysis.  Then 
the  ML£  can  be  used. 

In  other  situations,  the  influential  cases  will  be  suspected  as 
gross  errors.  Alternatively,  the  influential  cases  may  indicate  a  model 
deficiency,  but  either  the  sparsity  of  data  near  their  x  values  will  not 
allow  a  better  model  to  be  developed  or  the  data  analyst  will  hesitate 
to  add  complexity  to  the  model  merely  to  accommodate  one  or  at  most  a 
few  observations.  Then  a  robust  estimator  should  be  used. 

This  section  is  concerned  with  the  robust  estimation  of  6.  The 
scale  parameter  o  can  be  estimated  separately  with  a  robust  scale 
functional,  e.  g.  the  median  absolute  deviation  (MAD)  applied  to  the 
residuals  from  a  robust  fit.  Let  s^ty^®)  =  v<  ^  ( y  .  ,e  ,o  (9  ) )  be  the  score 
function,  i.  e.  the  gradient  of  the  log-likelihood  for  y_  The  weighted 
MLE  satisfies 

ri=i  "i  Vyrs>  =  0 

The  ordinary  (unweighted)  MLE  is  sensitive  to  cases  with  large  values  of 
Sj,  in  particular,  to  cases  with  large  values  of  residual  e^t/J.A),  O/Of. J 

(fM,(x.,/J)).  or  O/OK  s  ( y  ,  ®  ) .  corresponding  to  response,  high  leverage 
points,  and  points  having  high  influence  for  A.  respectively. 

A  robust  bounded- inf luence  estimator  can  be  found  by  letting  w^ 
decrease  as  some  norm  of  s  ^  ( y  .0  )  increases.  To  do  this  the  weights 


■ust  be  allowed  to  depend  on  9.  Thus  we  define  the  estimator  9  as  the 
solution  to 

(14)  w1(y1,S)  s1(yi  ,9,0(9))  =  0, 

where  w.  is  a  suitable  scalar  weighting  function  and,  as  in  section  2, 
"2  ' 

o  (8)  is  the  weighted  variance  of  the  residuals. 

Let  ♦^(y,8)  =  w^ (y ,9 ) s  (y ,9  ) .  The  minimum  requirement  for 

robustness  is  to  have  ♦  (y.fl)  bounded  as  a  function  of  i ,  y  ,  and  9. 
Otherwise,  a  single  outlier  can  cause  an  arbitrarily  large  change  in  9. 

It  should  be  mentioned  that  a  bounded- inf luence  estimators  may  not 
have  a  high  breakdown  point.  The  breakdown  point  is  the  largest 
percentage  of  contamination  that  an  estimator  can  tolerate  before  it  can 
be  overwhelmed  by  the  contaminants .  This  means  that  if  the  percentage 
of  contaminants  exceeds  the  breakdown  point,  then  the  estimate  can  be 
forced  to  take  an  arbitrary  value  by  choosing  the  contaminants  in  a 
sufficiently  nasty  way.  The  estimators  that  we  define  here  are  related 
to  the  Krasker-Welsch  regression  estimator,  and  like  that  estimator  they 
will  have  a  breakdown  point  at  most  (p+1)  where  (p+1)  is  the  total 
number  of  parameters. 

In  the  case  of  the  linear  model,  Rousseeuw  (1984)  and  Rousseeuw  and 
Yohai  (1984)  have  proposed  estimators  with  near  50%  breakdown  points, 
the  best  possible.  However,  these  estimators  have  poor  efficiency. 
Yohai  (1985)  shows  how  to  achieve  both  high  asymptotic  efficiency  and  a 
near  50%  breakdown  point,  but  again  only  in  the  case  of  linear 
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regression.  In  the  future,  we  hope  to  develop  similar  estimators  for 
transformation  models.  A  major  difficulty  will  be  computational 
complexity . 

When  choosing  w,  asymptotic  efficiency  measured  by  the  covariance 
matrix  of  9  must  be  balanced  against  robustness  measured  by  the  supremum 
of  some  norm  of  *  (y,0),  the  so-called  gross-error  sensitivity.  For 
univariate  parameters  there  is  a  unique  optimal  w^  which  minimizes  the 
asymptotic  variance  subject  to  a  given  bound  on  the  gross-error 
sensitivity  (Hampel  1968,  1974).  For  multivariate  parameters  such  as  9 
the  balancing  of  robustness  and  variance  raises  philosophical  questions, 
since  there  are  many  ways  of  comparing  covariance  matrices  or  of  norming 
vector  functions.  Different  norms  on  ♦ ^ ( y ^ , 0 )  give  rise  to  different 
definitions  of  gross-error  sensitivity. 

The  approach  we  take  generalizes  the  Krasker-Welsch  (1982) 
bounded- inf luence  regression  estimates.  Whether  the  Krasker-Welsch 
estimator  optimizes  the  asymptotic  covariance  matrix  in  any  meaningful 
sense  is  an  open  question,  but  its  efficiency  at  the  normal  model  is 
usually  close  to  that  of  the  MLE  (Ruppert  1985).  Krasker  and  Welsch 
(1982)  bound  the  so-called  self -standardized  gross-error  sensitivity. 


which  we  denote 

by  1 2-  We 

will 

describe  i ^ 

only 

briefly  and 

the 

interested  reader 

is  referred 

to 

the  original 

paper 

of  Krasker 

and 

Welsch  or  to  Hampel  et  al.  (1986,  chapter  4)  for  further  details. 

First  we  note  that  the  influence  function  of  9  satisfying  (14)  is 


IF^y..^)  =  B  V.(y.,0),  where 
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-IN  T 

B  =  -  N  ZI  =  1  7  E[#  j  (  y  .  .®  ) 


This  definition  of  the  influence  function  is  conditional  on  x  . x 

1  N 

but  coincides  with  the  usual  definiton  when  the  x's  are  independent  and 

identically  distributed  and  summation  over  i  is  replaced  by  expectation 

with  respect  to  the  x  s.  The  definition  of  B  is  analogous  to  that  on 

page  5  of  Carroll  and  Ruppert  (1985)  where  w  is  incorrectly  squared. 

*  -1  -1  T 

The  asymptotic  covariance  matrix  of  9  is  B  A(B  )  where 


A  =  N~1Z1^1E(#i(yi.<n*j(y..e)]. 


An  intuitively  reasonable  way  to  norm  influence  function  is  to  use  the 
asymptotic  covariance  matrix  of  the  estimator.  See  Krasker  and  Welsch 
(1982)  for  further  motivation  and  discussion.  The  resultant  measure  of 
influence  is  the  so-called  self-standardized  gross-error  sensitivity 
defined  as 


“»g  *  max  ll  I F  ( y  ^  ,0  )  lly  =  max  ll  Wj(y  ,0)  s^y  ,0)  II  , 
i  i  1 


T  -1  1/2 

where  ll  v  ll  =  (v  M  v)  for  any  vector  v  and  positive  definite  matrix 

M.  This  definition  of  ^  is  analogous  to  equation  (15)  of  Carroll  and 

2 

Ruppert  (1985)  where  w  has  been  incorrectly  omitted  from  the  last  term. 

To  robustify  the  MLE,  we  will  choose  the  weights  so  that  t  does 

s 

1/2 

not  exceed  a  predetermined  bound,  t  must  be  at  least  (p*l)  (Krasker 


and  Welsch  1982).  Proa  experience  with  this  and  other  problems  we 

\  /  2, 

suggest  bounding  i  by  a(p+l)  '  ^  where  "a"  is  between  1.1  and  1.6.  The 
s 

choice  a  =  1.5  worked  well  on  the  data  set  in  section  6. 

If  an  observation  has  low  influence  then  it  should  not  be 
downweighted.  Otherwise,  it  should  be  downweighted  just  enough  to  keep 
■»  below  the  given  bound.  Therefore  the  weights  should  be 

1/2 

(15)  “My,.*)  =  nin  {1.  a(*  1)  /lls  (y  .9  )ll  } . 

11  1  1  A 

In  (15)  A  must  be  replaced  by  an  estimate  A. 

To  calculate  9  we  used  a  simple  iterative  scheme: 


(1)  Fix  a  >  1.  Let  C  be  the  total  number  of  iterations  that  will  be 

used.  Set  c  =  l.  Let  9^  be  a  preliminary  estimate,  possibly  the  MLE . 

Set  w.  =1  for  all  i . 
l 


(2)  Define 


a  =  si<yrVsI(VV- 


(3)  Using  (15)  update  the  weights: 


w,  =  min(l,  a  (  p*  1 ) 1  2/ll  s  .  (  y  .  , e  II*} 
1  l  1  p  A 


(4)  Using  the  methods  of  section  2.  find  the  weighted  MLE  with  these 
weights,  and  call  it  9. 


(5)  If  c  <  C.  set  9  =  9.  c  ~  c  »  1,  and  return  to  step  (2). 

P 

Otherwise,  stop. 


It  is  possible  to  implement  this  algorithm  on  standard  software 


packages,  in  particular  SAS .  Steps  (2)  and  (3)  can  be  computed  with  a 
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matrix  language  such  as  PROC  MATRIX  on  the  1982  version  of  SAS.  Step 
(4)  can  be  performed  using  a  weighted  nonlinear  least-squares  routine 
such  as  PROC  NL  IN  on  SAS.  By  using  macros  on  SAS  or  a  similar  package, 
it  is  possible  to  put  the  matrix  computations  and  the  least-squares 
routines  into  an  iterative  loop.  We  Initially  used  SAS,  but  now  prefer 
the  matrix  programming  language  GAUSS. 

One  or  two  iterations  seem  adequate  for  diagnostic  purposes,  but 
this  algorithm  sometimes  converges  slowly,  particularly  when  there  are 
extremely  influential  points.  This  was  the  case  with  the  example  in 
section  6  where  the  algorithm  did  not  stabilize  until  ten  iterations. 
Unfortunately.  the  slow  convergence  gave  the  appearance  that  the 
algorithm  had  converged  after  only  two  or  three  iterations. 

We  found  that  a  fully  iterative  version  of  the  algorithm  could  be 
easily  implemented  with  the  GAUSS  on  an  IBM  PC -AT ,  and  computation  time 
for  ten  or  more  iterations  was  acceptable. 

Although  bounded- inf luence  estimation  limits  the  effect  of  any  case 
on  the  estimate,  ail  cases  regardless  of  how  deviant  from  the  bulk  of 
the  data  will  have  some  influence.  Hampel  el  al  .  (1986)  discuss  the 

need  for  robust  estimators  that  completely  reject  extreme  outliers.  For 
estimation  of  a  location  parameter,  this  can  be  done  with  a  redescending 
ps i  f unc t i on 

We  can  define  an  analog  to  a  redescending  psi  function  here.  Let 
*(x)  be  an  odd  function  with  * ( x )  £  0  for  \£  0.  and  define  the  weights 

(  16  )  wl  y  .  .9  )  *(H  s  ,  (y  .  .9  >M  ‘  )  II  s  .  (  y  .9  )ll  ' 

l  1  1  A  li  A 


Equation  (16)  reduces  to  (15)  if  ?  is  the  Huber  psi-function 


*(x)  =  x 


Ixl  5  b 


=  b  sign(x)  otherwise. 


with  b  =  a(p+l) 

If  for  some  R  >  0.  *(x)  =  0  for  ail  |x|  >  R,  then  extreme  outliers 
are  completely  rejected.  In  the  example  we  use  Hampel's  three-part 
redescending  psi-function 


0  i  x  £  b. 


bl(b3  ‘  X)/(b3 


bl  S  x  5  b2 

b2  S  x  *  b3 
b3  5  x, 


with  (b  .  b2,  b3)  =  (p-1)  (1.5,  3.5.  8.0). 

In  section  6,  the  redescending  estimate  was  computed  by  the  same 

algorithm  used  to  calculate  the  bounded- inf luence  estimate.  In  step  1, 

w,  and  0  were  from  the  last  iteration  of  the  bounded -inf luence 
i  P 

estimate.  In  step  (3)  the  weights  were  calculated  using  (16)  and  (17) 
instead  of  ( 15 ) . 


“  J  ' «\V„'  •%  ‘  -  ' 


5.  Estimating  The  Covariance  Matrix  of  8 


The  asymptotic  covariance  matrix  of  the  MLE  can  be  found  by  (i) 
inverting  the  Fisher  information  matrix  or  (ii)  using  the  covariance 
matrix  of  the  influence  function,  the  covariance  being  with  respect  to 

the  empirical  distribution  of  (y,.x.),  i=l . N.  Method  (ii)  is  based 

on  the  asymptotic  theory  of  M-est imat ion ;  see  for  example  Hampel  et  al. 
(1986.  section  4.2c).  One  advantage  of  (ii)  is  that  the  asymptotic 
covariance  is  consistently  estimated  even  if  the  t.  are  not  normally 
distributed  (Huber  1967)  or  do  not  have  a  constant  variance.  In  fact, 
method  (ii)  is  similar  to  the  jackknife  which  Wu  (1986)  advocates  as  a 
consistent  estimate  of  the  leas t - squares  covariance  matrix  under 
heteroscedast ici ty .  Moreover.  method  (ii)  can  be  used  for  the 
bounded- inf luence  and  redescending  estimates  as  well. 

Method  ( 1 ) :  The  observed  Fisher  information  matrix  for  (8,o)  is 

(  18  )  -  v2L (0.0). 

2 

where  the  weights  in  (3)  are  all  unity  and  v  means  the  Hessian  with 

respect  to  (8  ,o  )  We  could  invert  (18)  and  then  take  the  upper  left 
2 

(p  *  1)  corner  corresponding  to  9.  but  by  Patefield  11977,  1986)  this 

2 

is  equivalent  to  the  easier  computation  of  inverting  the  (p-1)  matrix 

v2[,  (8  )  . 

max 

2 

where  v  now  is  the  Hessian  with  respect  to  9  only 
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Method  (111 


The  MLE ,  bounded  influence  estimator,  and  the 


redescending  estimator  are  all  defined  by  the  equation 


ri*,  W»'  VV”  '  riN,  *  1 1  y i  ® 1  '  0 


and  only  differ  in  the  choice  of  the  weights.  The  asymptotic  covariance 

matrix  of  any  estimator  solving  an  equation  of  the  form  (19)  can  be 

'  - 1  *  *  -  i  t 

estimated  by  [B  A  (B  )  ]  where 


B  =  vT  X  9 . (y . ,0  )  and 
i  =  l  l  l 


4  ‘  VV‘"‘I<Vr‘” 


The  asymptotic  theory  of  M-estimat ion  also  shows  that  the  standard 
error  A  when  fitting  the  pseudo-model’'  (see  section  2)  are  incorrect 
The  pseudo  model  finds  the  MLE  by  minimizing 


N  2 

T  ,  ,  z .  <  y  ,9  ) 
i  =  1  l  l 


or  solving 


Zl'=l  Zi(yi  9)  =  ° 


When  the  pseudo  model  is  fit  by  a  nonlinear  least  squares  package  the 
estimated  covariance  matrix  is 


1  * » ■  .*•  • .  ■  *.  *  -  \  . 1  ~  _  *  >'*■ 
m  •  .  .  .  •  w"  >•  —  ^  _  —  •  •  jt  ^  ^  0  ~ 


(  21  ) 


a  (0  )  (I  .  vz  .  (y  .  ,9  )  v  ^y  .  ,$  ) } 

The  asymptotic  covariance  of  the  solution  to  (20)  is  consistently 
estimated  by 

8LslAlS,eLs'|T 

-here  ^t^l  W*’  Vzi<V9>}  and  ALS  = 

vz^ty^.0)  r^z.ly.  S)  It  is  not  hard  to  see  that  (21)  and  (22) 
have  different  limits  so  that  (21)  is  inconsistent.  In  practice  (21) 
and  (221  can  be  considerably  different,  especially  in  the  estimated 
variance  of  a 

6.  An  Examp 1 e  Prom  Fisheries  Analysis 

'n  this  section  we  look  at  an  example  Our  goals  are  (a)  to  see 
the  type  of  data  analytic  information  that  can  come  from  the  diagnostics 
and  f h e  robust  estimators,  (b)  to  see  how  well  the  robust  estimators 
ihandle  outlying  data  and  (c)  to  compare  the  accuracy  of  the  accurate'' 
approximation  d'  to  the  quirk  approximation 

When  managing  a  fish  stock  one  must  model  the  relationship  between 
f  he  annual  spawning  stock  size  and  the  eventual  production  of  new 
atihobie  sized  fish  '.returns  or  recruitsi  from  the  spawning  Ricker 
and  smitn  ' 197SI  give  numbers  of  spawnei s  (Si  and  returns  1RI  from  1940 


until  1967  for  the  Skeena  River  sockeye  salmon  stock. 

Using  some  simple  assumptions  about  factors  influencing  the 
survival  of  juvenile  fish,  Ricker  (1954)  derived  the  theoretical  model 

R  =  S  exp(/J2S)  =  f  ( S  ,/J ) 

relating  R  and  S.  Other  models  have  been  proposed,  e.  g.  by  Beverton 
and  Holt  (1957).  However,  the  Ricker  model  appears  to  fit  adequately 
and,  in  particular,  gives  almost  the  same  fit  to  this  stock  as  the 
Beverton-Holt  model. 

From  Figure  1,  a  plot  of  R  against  S,  it  is  clear  that  recruitment 

is  highly  variable  and  heteroscedastic ,  with  the  variance  of  R 

increasing  with  its  mean.  Several  cases  appear  somewhat  outlying,  in 

particular  #5,  #18,  #19,  and  #25. 

A 

An  index  plot  of  Dj  was  constructed;  see  Figure  2.  Clearly  case 
#12  stands  out  as  the  most  influential  by  this  measure,  and  #5,  #19,  and 
#25  are  only  moderately  influential  by  comparison.  We  will  examine 
these  cases  more  closely. 

E  A 

The  exact  case-deletion  statistic  A.  and  the  approximations  A ^  and 
are  given  in  table  1  for  these  four  cases. 

Case  #12  has  a  high  influence  on  all  three  parameters.  In 

g 

particular,  AA  ^  =  .51.  showing  that  the  MLE  of  A  decreases  from  .31  to 
-.2  when  #  12  is  deleted.  Why  is  #12  so  influential? 

To  see  why,  look  again  at  Figure  1..  Observation  #12  is  not  far 
removed  from  the  median  relative  to  the  variation  in  all  the  data,  but 


It  is  quite  far  removed  relative  to  the  variation  in  the  other  data  with 
similar  values  of  the  independent  variable  S. 

The  27  data  points  besides  #12  suggest  that  the  data  are  extremely 
heteroscedast ic ,  with  the  variation  in  recruitment  increasing  very 
rapidly  with  the  median  recruitment.  The  effect  of  #12  is  to  increase 
the  apparent  recruit  variation  for  low  median  recruitment  and  to  suggest 
that  the  heteroscedastici ty  is  not  so  nearly  pronounced.  Seen  in  this 
ilight,  the  much  more  severe  transformation  (A  =  -.2  instead  of  A  =  .31) 
when  #12  is  deleted  is  not  surprising. 

Besides  suggesting  less  heteroscedasticity  than  seen  in  the 
remainder  of  the  data,  #12  has  a  large  negative  residual  which  suggests 
less  right  skewness  as  well. 

To  further  analyze  the  influence  of  #12  on  A  we  will  introduce  two 

alternative  estimators  of  A.  These  will  be  discussed  fully  in  a 

forthcoming  paper  by  Aldershof  and  Ruppert.  For  fixed  A,  let  /0(A)  be 

*  -  T  T 

the  MLE  of  p  and  define  0(A)  =  (/3(A)  ,A  )  .  The  skewness  estimator  A 

S  K 

is  the  value  of  A  such  that  the  skewness  coefficient  of  the  residuals 

{e.(9(A))}  is  0.  The  heteroscedasticity  estimator  A.  .  is  the  value  of 
i  het 

A  such  that  the  correlation  between  (e^(0(A))}  and  { log(  f  (x  ,/3(A ) ) }  is 
0. 

When  case  #12  is  omitted  the  value  of  Ag^  only  changes  from  .58  to 

.46.  but  A,  ^  changes  much  further,  from  .16  to  -.86.  The  major  effect 
het 

of  deleting  #12  is  to  increase  the  heteroscedasticity. 

Case  #12  was  the  year  1951  when  a  rock  slide  drastically  reduced 
recruitment  (Ricker  and  Smith  1975).  For  this  reason  we  are  quite 


comfortable  downweighting  It  severely.  In  fact,  It  seems  best  to  reject 
#12  entirely.  This,  in  effect,  is  what  the  redescending  estimator  does, 
see  below. 

Compared  to  case  #12,  cases  #5,  #19.  and  #25  all  have  the  opposite 

effect  on  A.  Deleting  any  of  them  decreases  the  apparent  recruitment 

variance  when  the  number  of  expected  recruits  is  large,  in  effect 

E 

suggesting  less  heteroscedasticity .  For  this  reason  AK  is  positive  for 
i  =5,  19,  and  25. 

The  effect  of  #12  on  p  is  not  as  easily  analyzed  as  its  effect  on 

A.  Deleting  #12  increases  p^  from  3.29  to  3.77  and  decreases  p from 

7.00  to  -9.54.  These  effects  tend  to  cancel  but  not  completely.  As 

shown  in  Figure  3,  the  net  effect  of  including  #12  is  a  decrease  in 

f  ( S  ,/5 )  for  small  S  and  an  increase  for  large  S.  The  former  effect  is 

plausible  since  #12  has  a  low  value  of  S  and  a  negative  residual.  The 

increase  in  f(S,/3)  for  large  S  is  not  so  plausible,  but  it  is  a 

consequence  of  using  the  Ricker  model  for  the  median  recruitment. 

E 

Having  analyzed  the  exact  changes  A  .  ,  we  turn  to  the  accuracy  of 

the  approximations  and  4^.  The  accuracy  of  A®  is  poor  for  cases  #5 

and  #12.  especially  #12,  and  these  are  precisely  the  cases  of  interest. 

E 

The  same  is  true  of  the  approximations  to  LD^ ;  the  quick  approximation 
is  rather  inaccurate. 

The  quick  estimate  does  better  for  cases  #19  and  #25,  but  this  is 
merely  fortuitious.  The  overestimation  by  the  quick  estimate  is  small 
here  and  cancels  the  error  from  not  recalculating  y  after  case-deletion. 
To  see  this  we  can  examine  the  changes  in  the  MLE  induced  by 


rase  deletion  with  y  Is  kept  equal  to  the  geometric  mean  of  all  the  y’s 

These  changes  are  -  17,  .77.  013.  and  Oil  for  i  =  5,  12,  19.  and  25. 

A 

respectively  in  all  four  cases,  the  change  is  closer  to  A  A  ,  than  to 
d^A 

1 

We  calculated  20  iterations  of  the  bounded  influence  estimate  with 
the  tuning  constant  a  =  1  5.  and  using  this  as  a  starting  value  we 

calculated  10  iterations  of  the  redescending  estimate  with  (b^.b^ ,63) 

(  p  -  1 ) 1  2 ( 1 . 5 .  3.5.  80). 

The  bounded  influence  estimate  changed  rapidly  for  the  first  five 
iterations,  more  slowly  for  the  next  five,  and  then  was  stable  for  the 
last  ten  iterations.  To  show  the  behavior  of  the  algorithm,  the  value 
of  9  and  non  unity  values  of  w.  are  given  in  table  2  for  iterations  1. 
2.  3.  4,  5.  10,  and  20.  It  is  interesting  to  examine  w„.  This  is  l  for 
the  first  three  iterations  but  eventually  decreases  to  .65.  less  than 


Case  #4  is  influential,  but  its  influence  is  masked  by  #12.  Robust 
estimation  or  subset  deletion  diagnostics  seem  necessary  to  detect  the 
influence  of  #4. 

The  values  of  0  and  nonunity  values  of  w  are  also  given  in  table  2 

for  the  redescending  estimate  Case  #12  is  completely  rejected,  e.  g  , 

W12  ^°r  iterations.  The  value  of  w^  decreases  slowly  for  three 

iterations  and  then  jumps  downward  to  almost  0  on  the  fourth  iteration. 

With  #4  nearly  deleted  the  weights  w,_.  wiri,  and  w  readjust  on  the 

lb  1  y  25 

fifth  and  sixth  iterations  The  redescending  estimate  is  stable  after 
the  sixth  iteration 
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The  first  iteration  of  the  redescending  estimator  is  nearly 
identical  to  the  MLE  without  #12  After  #4  is  strongly  downweighted,  A 
decreases  to  about  -.3  which  suggests  even  slightly  stronger 
heteroscedast  icity  As  both  #12  and  #4  are  completely  or  nearly 
completely  rejected.  #16  becomes  influential  and  is  slightly 
downwe ighted 

We  do  not  necessarily  advocate  using  the  ful ly- iterated 
redescending  estimator  The  bounded- inf luence  estimator  or  the  first 
iterate  of  the  redescender  give  a  good  fit  to  the  bulk  of  the  data  and 
reject  the  outlier  #12  Without  further  information  about  these  data, 
we  are  not  comfortable  downweighting  #4  and  #16  as  much  as  the 
fully  iterated  redescending  estimator  downweights  them 

If  forced  to  choose  one  estimate,  we  would  choose  the  one  step 
redescender.  The  residuals  {e^{6})  from  this  estimate  are  plotted  in 
Figure  4.  Ignoring  e12(0),  remaining  residuals  show  only  slight 
heteroscedasticity  and  almost  no  skewness. 

The  redescending  estimator  completely  rejects  #12  so  there  is  no 
need  to  refit  with  this  anomalous  case  removed.  However,  it  is 
instructive  to  see  what  happens  if  this  is  done.  Both  the 
bounded- inf luence  and  the  redescending  estimators  converge  rapidly,  with 
the  first  iterate  equal  for  practical  purposes  to  the  fully-iterated 
estimate.  This  suggests  that  the  algorithm  converges  slowly  only  in  the 
presence  of  an  extremely  influential  point  such  as  #12. 

In  table  3  we  give  the  standard  errors  of  the  MLE  by  method  (1), 
inverting  the  observed  Fisher  information  matrix.  We  also  give  the 


method  ( i  i )  standard  errors  for  the  MLE ,  the  one-step  and  iterated 
(c  =  10)  bounded-influence  estimator,  and  the  one-step  and  iterated 
(c  =  5)  redescending  estimator. 

The  method  (i)  and  method  (ii)  standard  errors  of  the  MLE  of  A  are 
moderately  different,  but  both  are  considerably  smaller  than  the 
standard  error  of  A  for  the  robust  estimators.  Downweighting  #12  seems 
to  increase  the  variability  of  A,  but  since  #12  is  known  to  be  an 
unusual  year  it  seems  wise  to  use  a  robust  estimator  despite  the  higher 
variability.  The  standard  errors  of  p^  and  p  are  smaller  for  the 
robust  estimators  than  for  the  MLE. 

At  least  in  this  example,  robust  estimation  appears  to  cause  a  loss 
in  efficiency  of  A  but  a  gain  in  efficiency  of  p.  However,  this  loss  in 
efficiency  for  A  is  only  apparent,  not  real.  There  are  two  ways  of 
looking  at  this,  either  conditioning  or  not  conditioning  on  the  event 
that  a  rock  slide  occurred  in  1951.  Conditional  on  there  being  exactly 
one  rock  slide  and  it  occuring  in  a  year  when  the  number  of  spawners  was 
low,  the  MLE  has  a  small  variance  but  a  large  bias  and  consequently  a 
large  mean  square  error  relative  to  the  robust  estimators.  If  we  do  not 
condition  on  the  occurence  of  exactly  one  slide,  but  rather  admit  that 
some  other  number  of  slides  could  have  occurred  and  that  these  could 
have  occurred  at  any  years,  then  it  is  clear  that  the  MLE  is  really  much 
more  variable  than  its  standard  error  shows 

The  change  in  A  when  #12  is  deleted  is  large  relative  to  the 

E 

standard  errors  of  A.  Notice  that  A  A  is  over  twice  the  standard 
error  of  the  MLE  and  about  1,65  times  the  standard  error  of  the  one-step 


redescending  estimator. 
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Table  1  Deletion  diagnostics  for  the  most  influential  cases.  The 
superscripts  denote  the  method  of  calculation:  E  =  exact,  A  =  accurate 
approximation,  Q  =  quick  approximation. 


DIAGNOSTIC 


*ea. 
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A 


*Efi 


1 ,  i 


„A„ 
A  ft 


1 .  i 


4Q/j 


1  ,  i 


AEp 


2.  i 


AAp 


2,  i 


AQp 


2,  i 


lde 

i 


LD  . 


l 


LDQ 
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CASE  NUMBER 
12  19 


25 


-  .  10 

.51 

-  .034 

.033 

-.14 

.  61 

-.015 

-.015 

-.32 

1.57 

-.036 

-.034 

-.22 

-.48 

.08 

.  12 

-.22 

-.45 

.11 

.  15 

-.34 

-.43 

.  10 

.  14 

1.48 

2.54 

-1 .02 

-1.25 

1.47 

2.57 

-1  . 14 

-1.38 

1  .  80 

3.72 

-1  . 12 

-1.36 

.  58 

9.64 

.32 

.38 

.  72 

8.7 

.27 

.34 

1  .  29 

24.3 

.  27 

.  34 
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Table  2.  Estimates  of  p  and  A  and  the  case  weights  for  the  robust 
estimators.  BIE  =  bounded-influence  estimator.  RE  =  redescending 
estimator . 


ESTIMATES  CASE  WEIGHTS 


*1 

^2 

A 

W4 

W5 

W 1 2 

W  16 

W 19 

W25 

MLE 

3.29 

-7.00 

.31 

1 

1 

1 

1 

1 

1 

BIE : C=1 

3.49 

-8 . 00 

.  27 

1 

.71 

.  58 

1 

1 

1 

C  =  2 

3.59 

-8.54 

.20 

1 

.63 

.34 

1 

1 

1 

C  =  3 

3.66 

-8.87 

.  12 

1 

.62 

.  20 

1 

1 

99 

rr 

II 

u 

3.70 

-9.11 

.06 

.98 

.62 

.  13 

1 

1 

.98 

o 

ii 

Ol 

3.74 

-9.31 

.02 

.91 

.62 

.087 

1 

1 

.96 

o 

II 

-J 

3.84 

-9.71 

-.07 

.69 

.63 

.  041 

.97 

1 

.91 

C  =  20 

3.85 

-9.74 

-.08 

.65 

.64 

.038 

.92 

1 

.91 

RE:  C=1 

3.91 

-10.1 

-.21 

.  65 

.64 

0 

.92 

1 

.91 

c=2 

3.93 

-10.1 

-.23 

.52 

.71 

0 

.86 

.97 

.86 

c  =  3 

3.92 

-10.0 

-.23 

.46 

.75 

0 

.81 

.  94 

.  82 

o 

II 

3.91 

-9.97 

-.23 

.43 

.  76 

0 

.  78 

.  92 

81 

c=5 

4.11 

-10.7 

-  .36 

.006 

.77 

0 

.76 

.90 

.79 

c  =  10 

4.00 

-9.94 

-.34 

.  005 

.91 

0 

.64 

.71 

.61 
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Table  3:  Standard  errors  of  p  and  A  . 
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ESTIMATOR 


P 1  ^2 


MLE  (inverse  risher  .75  3.40  .21 

information  -  method  ( i ) ) 

(as  an  M-estimator  -  .66  3.46  .16 

method  ( i  i )  ) 


BIE: 

One-step  (c=l) 

.  54 

3.01 

.42 

c  =  10 

.61 

3.51 

.39 

RE: 

One  -step  ( c  =  1 ) 

.51 

2.79 

.35 

Fully  iterated  (c=10) 

.  43 

2.37 

.35 

ill  <t»'l  ill 


LIST  OF  FIGURES 


Figure  I:  Plot  of  returns  (or  recruits)  against  spawners  with  median 

recruitment  estimated  using  the  one-step  redescending  estimate.  Returns 
and  spawners  are  in  thousands  of  fish.  Selected  cases  are  identified 

A  l  /  2 

Figure  2 :  Index  plot  of  (LDJ  ,  the  square  root  of  the  approximate 

likelihood  distance 

Figure  3:  Difference  in  median  recruitment  estimated  with  and  without 
case  #12. 

Figure  4:  Residuals  from  the  one-step  redescending  estimator. 
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